Harshawardhan Nene, BFS Innovations, harshawardhan.nene@cognizant.com
[PRIMARY CONTACT]
MS Excel was used for quick preliminary
visualizations.
TGraph2010, a turtle graphics
program written in Java was used to try out a few ideas.( http://sourceforge.net/projects/turtlegraph/)
This tool had no documentation and a lot of limitations
for the task at hand. Nevertheless, it helped me to quickly try out a bunch of
ideas.
The final visualizations were generated using Processing (http://processing.org/), an open source "programming language and IDE built for the
electronic arts and visual design communities"
Processing is an impressive tool for creating
visualizations and has a good online documentation along with a decent set of
tutorials. However, being new to the tool, I had little time to use any of its
advanced features.
Thus, most of the data was manipulated in Java using
the Netbeans
6.8 IDE (http://netbeans.org/). The output was a set of basic drawing instructions in Processing which would then generate the visualization.
Inkscape (http://www.inkscape.org/) was used for adding captions to the images. The entire visualizations
are automatically generated. Only the explanatory text captions were added
manually.
Video:
ANSWERS:
MC3.1: What is the region or country of origin for the current
outbreak? Please provide your answer as
the name of the native viral strain along with a brief explanation.
Answer: Nigeria_B.
The skyline visualization takes the base with maximum occurrences at each
column in the 58 current-outbreak strains. The resulting sequence is then
compared to each native sequence. Longer streaks of matching nucleotides result
in larger squares. A mutation is marked by a white line and a new square starts
growing with the next streak. The white lines help identify the concentration
of mutations in a region. Thus, fewer and larger squares in a skyline indicate
fewer mutations. The length of each skyline is 1404 pixels, corresponding to
the 1404 nucleotides. This retains positional information and allows
comparisons between strains at any given column. The figure at the bottom left
of each square indicates the length of the streak.
MC3.2: Over time, the virus spreads and the diversity of the virus
increases as it mutates. Two patients
infected with the Drafa virus are in the same
hospital as Nicolai. Nicolai has a
strain identified by sequence 583. One
patient has a strain identified by sequence 123 and the other has a strain
identified by sequence 51. Assume only a
single viral strain is in each patient.
Which patient likely contracted the illness from Nicolai and why? Please provide your answer as the sequence
number along with a brief explanation.
Answer: 123. Using the skyline view for comparing
sequence 123 and 51 with Nicolai’s strain, we observe that sequence 123 is
closer to that of Nicolai. Strain 51 has two additional mutations (at columns
842 and 946). The nucleotides observed at these columns are better distributed
that most other columns (T41, C17 at 842; A44, T14 at 946). Having different
nucleotides at these columns further reduces the likelihood that the patient
with strain 51 acquired the illness from Nicolai.
MC3.3: Signs and symptoms of the Drafa
virus are varied and humans react differently to infection. Some mutant strains from the current outbreak
have been reported as being worse than others for the patients that come in
contact with them.
Identify the top 3 mutations that lead to an increase in symptom
severity (a disease characteristic). The
mutations involve one or more base substitutions. For this question, the biological properties
of the underlying amino acid sequence patterns are not significant in
determining disease characteristics.
For each mutation provide the base substitutions and their position in
the sequence (left to right) where the base substitutions occurred. For
example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A at position 513
and T changed to A at position 907)
A → G, 39 (A changed to G at position 39)
Answer: A→C 269
G→C 212
A→T 946
Nearly 95% of the columns have the same nucleotide across the 58
strains. Only columns showing a distribution of at least 95% / 5% were
considered for this question and the next. These columns have been arranged in
order with the extreme left column (161) having a nearly even distribution and
the extreme right column (821) having the most skewed distribution. The larger
groups figure in the top row and the smaller groups in the bottom row. We
assume that the smaller groups are the mutations. Each characteristic was
simplified to: severe / not severe. The percentage of severe characteristics
was calculated for each group in each column. The higher of the two percentages
for a given characteristic is highlighted red. Bars have been used to show the
absolute difference between the percentages.. The
brighter blue bars are the top 3 mutations.
MC3.4: Due to the rapid spread of the virus and limited resources,
medical personnel would like to focus on treatments and quarantine procedures
for the worst of the mutant strains from the current outbreak, not just symptoms
as in the previous question. To find the
most dangerous viral mutants, experts are monitoring multiple disease
characteristics.
Consider each virulence and drug resistance characteristic as equally
important. Identify the top 3 mutations
that lead to the most dangerous viral strains. The mutations involve one or
more base substitutions. In a worst case
scenario, a very dangerous strain could cause severe symptoms, have high
mortality, cause major complications, exhibit resistance to anti viral drugs, and
target high risk groups. For this
question, the biological properties of the underlying amino acid sequence
patterns are not significant in determining disease characteristics.
For each mutation provide the base substitutions and their position in
the sequence (left to right) where the base substitutions occurred. For
example,
C → G, 456 (C changed to G at position 456)
G → A, 513 and T → A, 907 (G changed to A at position 513
and T changed to A at position 907)
A → G, 39 (A changed to G at position 39).
Answer: A→C 269
G→C 223
A→T 946
The sum of absolute differences is summed up for each column where the
characteristics are marked red. Eg. The chance of having severe symptoms is 88% higher if
Column 269 has a C instead of a T. The severity is higher for symptoms,
mortality, drug resistance and vulnerability at 269 thus 82+88+36+30 gives us
236. The top 3 sums are highlighted and answer the question.